Add per-module memory regression baseline check to CI Host by lukemelia · Pull Request #4430 · cardstack/boxel

lukemelia · 2026-04-16T23:53:10Z

Summary

Add per-module memory probes (MEMPROBE_FILE) to host tests via QUnit suiteStart/suiteEnd hooks, logging heap usage and delta for each top-level test module
Add __shard_warmup__ synthetic module that runs first on every shard, absorbing ~36MB of shared boot cost so real test modules report clean per-file deltas
Add CI pipeline to extract per-shard memory reports, compare against a committed baseline (memory-baseline.json), and flag regressions with tiered thresholds:
- Warn: >10% increase or +5MB (whichever is greater)
- Fail: >100% increase or +50MB (whichever is greater)
Auto-update baseline on main merge when all tests pass

How it works

Each shard extracts MEMPROBE_FILE lines from test output into a JSON artifact
The merge-reports job downloads all 20 shard reports and runs check-memory-baseline.mjs
Results appear in GITHUB_STEP_SUMMARY; hard failures block the PR
On main push, update-memory-baseline.mjs regenerates and commits the baseline

Noise validation

Ran 3 identical CI runs on the same SHA to measure reproducibility:

Median per-module delta spread: 0.0 MB across all size buckets
p90 spread: 0.7 MB; max spread: 39.6 MB (one outlier)
Sharding is 100% deterministic (same modules, same order, all 3 runs)
A 10% threshold produces essentially zero false positives on unchanged code

New files

packages/host/scripts/extract-memory-report.mjs — per-shard log parser
packages/host/scripts/check-memory-baseline.mjs — baseline comparison
packages/host/scripts/update-memory-baseline.mjs — baseline regeneration
packages/host/memory-baseline.json — initial baseline (170 modules, median of 3 runs)
packages/host/tests/helpers/shard-warmup.ts — synthetic warmup module

Test plan

Validated full pipeline via workflow_dispatch run 24538269499
All 20 memory report artifacts uploaded
Check step correctly identified 1 soft warning, 0 hard failures
Baseline update correctly skipped (not a main push)
Verify baseline auto-update works on merge to main

🤖 Generated with Claude Code

- Add per-module memory probes via QUnit suiteStart/suiteEnd in setup-qunit.js that log heap delta for each top-level test module (MEMPROBE_FILE lines). Uses double-GC at module boundaries for accurate snapshots. - Add __shard_warmup__ synthetic module (shard-warmup.ts) that runs first on every shard to absorb shared boot cost (~36MB), giving real test modules clean per-file deltas independent of shard position. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Per-shard memory reports are extracted from MEMPROBE_FILE test output and uploaded as artifacts. The merge-reports job aggregates them and compares against a committed baseline (packages/host/memory-baseline.json). Tiered thresholds: - Warn: >10% increase or +5MB (whichever is greater) - Fail: >100% increase or +50MB (whichever is greater) On main merge (all tests green), the baseline auto-updates so it tracks the current state. On PRs, regressions are flagged in GITHUB_STEP_SUMMARY. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-17T00:02:53Z

Preview deployments

Copilot

Pull request overview

Adds a CI-level per-module memory regression check for Host tests by emitting per-module heap deltas during QUnit runs, aggregating shard reports, and comparing results against a committed baseline to warn/fail on regressions.

Changes:

Add a synthetic __shard_warmup__ test module to absorb shard boot cost before real modules run.
Add QUnit suiteStart/suiteEnd hooks to log per-top-level-module heap usage deltas (MEMPROBE_FILE ...).
Add CI steps and Node scripts to extract per-shard reports, compare to memory-baseline.json, and (on main) auto-update the baseline.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
packages/host/tests/test-helper.js	Side-effect import to register the warmup QUnit module before test partitions load.
packages/host/tests/helpers/shard-warmup.ts	New synthetic warmup module that primes common runtime/setup work.
packages/host/tests/helpers/setup-qunit.js	Adds per-top-level-module heap delta logging via QUnit suite hooks.
packages/host/scripts/extract-memory-report.mjs	Parses test output and emits a per-shard JSON memory report artifact.
packages/host/scripts/check-memory-baseline.mjs	Compares merged reports against a committed baseline and emits CI summary + exit code.
packages/host/scripts/update-memory-baseline.mjs	Regenerates `memory-baseline.json` from shard reports (for main-branch updates).
packages/host/memory-baseline.json	Introduces initial committed baseline for per-module memory deltas.
.github/workflows/ci-host.yaml	Uploads memory artifacts per shard, merges them, checks baseline, and updates baseline on `main`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T00:04:56Z

+  contents: write
  id-token: write
  pull-requests: write


Workflow-level permissions grants contents: write for all events (including pull_request). That unnecessarily broad scope increases the blast radius of any workflow change in a PR. Consider keeping workflow default at contents: read and granting contents: write only at the job (or step) that performs the baseline commit/push, gated to push on main.

Suggested change

contents: write

id-token: write

pull-requests: write

contents: read

id-token: write

pull-requests: read

Copilot · 2026-04-17T00:04:56Z

+const JSON_ENVELOPE_RE = /\{"type":"log","text":"(.*?)"\}\s*$/;
+
+const log = readFileSync(inputPath, 'utf8');
+const report = {};
+
+for (const rawLine of log.split('\n')) {
+  if (!rawLine.includes('MEMPROBE_FILE')) continue;
+
+  let line = rawLine;
+
+  // Unwrap testem JSON envelope if present
+  const envMatch = line.match(JSON_ENVELOPE_RE);
+  if (envMatch) {
+    try {
+      line = JSON.parse(`"${envMatch[1]}"`);
+    } catch {
+      // fall through to raw parse
+    }
+  }
+


The testem JSON envelope unwrapping is fragile: JSON_ENVELOPE_RE will stop at the first " inside the text field (which is very likely here because MEMPROBE_FILE logs include a quoted module= value), causing the unwrap to fail and the probe line to be skipped. A more robust approach is to JSON.parse(rawLine) when it looks like a JSON object and read .text from { type: "log" }, falling back to the raw line otherwise.

Suggested change

const JSON_ENVELOPE_RE = /\{"type":"log","text":"(.*?)"\}\s*$/;

const log = readFileSync(inputPath, 'utf8');

const report = {};

for (const rawLine of log.split('\n')) {

if (!rawLine.includes('MEMPROBE_FILE')) continue;

let line = rawLine;

// Unwrap testem JSON envelope if present

const envMatch = line.match(JSON_ENVELOPE_RE);

if (envMatch) {

try {

line = JSON.parse(`"${envMatch[1]}"`);

} catch {

// fall through to raw parse

}

}

function unwrapTestemLogLine(rawLine) {

const trimmed = rawLine.trim();

if (!trimmed.startsWith('{') || !trimmed.endsWith('}')) {

return rawLine;

}

try {

const parsed = JSON.parse(trimmed);

if (

parsed &&

parsed.type === 'log' &&

typeof parsed.text === 'string'

) {

return parsed.text;

}

} catch {

// fall through to raw parse

}

return rawLine;

}

const log = readFileSync(inputPath, 'utf8');

const report = {};

for (const rawLine of log.split('\n')) {

if (!rawLine.includes('MEMPROBE_FILE')) continue;

const line = unwrapTestemLogLine(rawLine);

Copilot · 2026-04-17T00:04:56Z

+if (failures.length > 0) {
+  lines.push(`### Failures (>${HARD_RELATIVE * 100}% increase or +${HARD_ABSOLUTE_MB}MB)\n`);
+  lines.push('| Module | Baseline | Current | Change |');
+  lines.push('|--------|----------|---------|--------|');
+  for (const f of failures.sort((a, b) => b.diff - a.diff)) {
+    lines.push(
+      `| ${f.mod} | ${f.baseline.toFixed(1)} MB | ${f.current.toFixed(1)} MB | +${f.diff.toFixed(1)} MB (+${f.pct}%) |`,
+    );
+  }
+  lines.push('');
+}
+
+if (warnings.length > 0) {
+  lines.push(`### Warnings (>${SOFT_RELATIVE * 100}% + ${SOFT_ABSOLUTE_MB}MB increase)\n`);


The headings in the step summary output don’t match the implemented thresholds. The code uses Math.max(absolute, relative) (i.e. “whichever is greater”), but the strings currently read like “or” / “10% + 5MB”, which can mislead readers about when a module will actually warn/fail. Update the summary text to reflect the “whichever is greater” behavior (or adjust the threshold logic to match the wording).

github-actions · 2026-04-17T00:23:03Z

Realm Server Test Results

1 files ±0 1 suites ±0 14m 3s ⏱️ +17s
894 tests ±0 894 ✅ ±0 0 💤 ±0 0 ❌ ±0
966 runs ±0 966 ✅ ±0 0 💤 ±0 0 ❌ ±0

Results for commit 3a2bd45. ± Comparison against base commit 9282ace.

♻️ This comment has been updated with latest results.

github-actions · 2026-04-17T00:23:41Z

Host Test Results

1 files + 1 1 suites +1 2h 13m 26s ⏱️ + 2h 13m 26s
2 259 tests +2 259 2 244 ✅ +2 244 15 💤 +15 0 ❌ ±0
2 278 runs +2 278 2 263 ✅ +2 263 15 💤 +15 0 ❌ ±0

Results for commit 3a2bd45. ± Comparison against base commit 9282ace.

♻️ This comment has been updated with latest results.

backspace · 2026-04-17T02:29:36Z

Add CI pipeline to extract per-shard memory reports, compare against a committed baseline (memory-baseline.json), and flag regressions with tiered thresholds:

I see the baseline file in this PR but it doesn’t seem like the comparison or flagging are happening, is this still in progress or for a followup?

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ylm and others added 2 commits April 16, 2026 18:52

lukemelia requested review from backspace and Copilot April 16, 2026 23:56

Copilot started reviewing on behalf of lukemelia April 16, 2026 23:59 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

backspace added a commit that referenced this pull request Apr 17, 2026

host: Add deliberate memory leak to exercise #4430

a07344f

Fix import ordering lint errors in shard-warmup.ts

3a2bd45

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-module memory regression baseline check to CI Host#4430

Add per-module memory regression baseline check to CI Host#4430
lukemelia wants to merge 3 commits intomainfrom
memprobe-per-file-experiment

lukemelia commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

github-actions bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

backspace commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lukemelia commented Apr 16, 2026

Summary

How it works

Noise validation

New files

Test plan

Uh oh!

github-actions bot commented Apr 17, 2026

Preview deployments

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Realm Server Test Results

Uh oh!

github-actions bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Host Test Results

Uh oh!

backspace commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Apr 17, 2026 •

edited

Loading

github-actions bot commented Apr 17, 2026 •

edited

Loading